Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "ga4gh" namespace identifier registry and instructions/policy to TASC repo #32

Closed

Conversation

jb-adams
Copy link
Member

@jb-adams jb-adams commented Aug 6, 2021

This PR aims to create the documentation for using and contributing to the ga4gh namespace identifier and type prefixing system, as well as creating the initial type prefix registry itself.

  • Type prefix registry has essentially been lifted from the table in VRS 1.2.0. This can constitute the initial state of the GA4GH-wide registry.
  • The Background, Proposal, and Type Prefixes sections in the process document were taken from the Proposal for GA4GH-wide Computed Identifier Standard. These sections need to be further edited in my branch to make them less VRS-specific
  • The Type Prefix Submission Process section was adapted from @rishidev 's earlier PR on the same topic

@ahwagner @rrfreimuth @mamanambiya @jmarshall @susanfairley

Copy link
Member

@reece reece left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in comments below, I recommend that a new repo be created for shared assets, config, etc., that the authoritative source for prefixes be machine-readable, and that the documentation be adjacent to that code. Therefore, this PR would move to a new repo.


To add an entry to the type prefix registry:
* Fork the TASC repository in your own user space
* In your user space, create a new branch from `ga4gh:master`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GA4GH should use a consistent default branch name across all projects. Most projects are migrating to using main rather than master for reasons discussed elsewhere. That is my preference too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree with this and certainly we should move what we can do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I will bring this up at TASC to rename it as main for at least this repo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jb-adams is there an open GH issue to track this so we can resolve this thread?

To add an entry to the type prefix registry:
* Fork the TASC repository in your own user space
* In your user space, create a new branch from `ga4gh:master`
* In the new branch, add an additional row in the markdown table in `type-prefix-registry.md` with your proposed type prefix.
Copy link
Member

@reece reece Sep 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the authoritative table should be code, preferably YAML, and not merely in documentation. That code should probably be in a new central/shared/core repo under TASC's purview, e.g., ga4gh-common. PRs to add prefixes should be in that code. And, the documentation for those should be in the same repo, adjacent to the changes.

When I originally implemented prefixes for VRS, I anticipated separating GA4GH shared components from VRS. For this reason, the VRS repo has a separate file ga4gh.yaml.

So, the most straightforward action would be to move this file to another repo and VRS clients would include that (as a subrepo or with vendoring).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. the new repo would be the authoritative GA4GH namespace registry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also strongly agree.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reece do you have any examples of how the registry would be used in a machine-readable sense? We discussed this at TASC and saw it mainly as human-readable documentation. i.e. developers of data platforms/systems would read through the registry and find GA4GH prefixes related to data models they are using, and then apply our identifier system to their objects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jb-adams not that I want to speak for @reece or @larrybabb, but I do think the intent here is to be descriptive of the repo contents, as you suggest.

It might make sense to rename this repo to do so, but I think that there are several other TASC-related concerns that are tracked here; do you want to limit the scope of the TASC repo to concern itself only with the ga4gh-common technical product?

I think @reece sees such a repo to eventually serve as a complete replacement for the GA4GH YAML in VRS in addition to other technical artifacts to be coordinated / blessed centrally by GA4GH.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jb-adams : Apologies for missing the questions two weeks ago.

As @ahwagner suggested, please see the VRS repo. The file he pointed to (ga4gh.yaml) is used to build a map of type ⇒ prefix, which is then used during identifier assignment. That is, when an object of type Allele is serialized, the prefix for that identifier is specified in the file ga4gh.yaml. In an ideal world, this mapping would be stored in exactly one place and all code and documentation would be derived from it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jb-adams : IMO, "TASC" is a group and doesn't help understand what this code is about. "config" or "globals" might be a better name. Admittedly, I know of only one file that would go there now.

Despite being a primarily Python guy these days, I do think that files of this sort should be language neutral.

Copy link
Member

@jmarshall jmarshall Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re repository naming: This is trivia. It doesn't matter what repo this prefix registry is in; what matters is that it exists. This repository currently contains information about the TASC committee and the registry of Service Info types, and the PR proposes to add a registry of prefixes. Both of these registries are administered by the TASC committee, and IMHO this ga4gh/TASC repository is a convenient place to gather them.

The place that this information ought to be located in is a GA4GH technical web site. To date, GA4GH as an organisation has not provided a technical web site for the information produced by its working groups. The rumblings are that this is finally on the way to changing. In the meantime, this repository with the PR as proposed is as good a staging area as any; IMHO “keep it simple” applies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re maps of type ⇒ prefix: VRS's ga4gh.yaml file contains entries like Allele: VA. But what does Allele mean?

In the VRS context, it specifically means VRS's Allele object as defined in your schema. That's good and concrete.

However in the (lack of) context of this general registry list, it's just a word. Different groups will inevitably mean slightly different things when they use the word. That's why IMHO this registry's entries need to consist of the reserved prefix (VA), a prose description of the intention (“An allele, encompassing blah blah”), and links to its specific incarnations in individual GA4GH protocols (link to VRS's Allele initially; eventually also links to other groups' protocols' uses of VA).

I guess this is the crux of why I don't see this registry as a machine-readable list. What would the machine-readable entries contain? We are trying to make a list of registered prefixes like VA (and collate their use in GA4GH formats and protocols); we are not trying to define an ontology of terms like Allele. It's great to have a type/prefix map in a concrete context like VRS, but I don't see that it's feasible or useful for TASC to attempt in the general context.


The computed identifier scheme proposed in the VR Specification computes identifiers from the data itself. Because identifers depend on the data, groups that independently generate the same variation will generate the same computed identifier for that entity, thereby obviating centralized identifier systems and enabling identifiers to be used in isolated settings such as clinical labs.

The computed identifier mechanism is broadly applicable and useful to the entire GA4GH ecosystem. Adopting a common identifier scheme will make interoperability of GA4GH entities more obvious to consumers, will enable the entire organization to share common entity definitions (such as sequence identifiers), and will enable all GA4GH products to share tooling that manipulate identified data. In short, it provides an important consistency within the GA4GH ecosystem.
Copy link
Contributor

@andrewyatz andrewyatz Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whilst this has origins within the computed identifier space, is there a reason to limit this to just computed identifiers? Unambiguous is the target right? So long as that can be achieved then no dramas I think

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do think it would be limited only to computed identifiers.

However, if we were to allow one or more authorities to mint, maintain and manage curated identifiers then we would need some oversight to make sure each prefix was assigned to one and only one authority so as to avoid collissions.

I suppose it would be technically okay to do this, but there would need to be some oversight, guidelines and transparency on assuring certain prefixed namespaces are "owned" by one and only one authority.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computed identifiers would be my preference too, as curated identifiers already have a route via identifiers.org etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realistically, the way GA4GH-namespaced identifiers are managed is already tightly coupled to the associated specification, so I can see the argument to allow non-computed identifiers in this namespace. However, in the case of entities that have registered identifiers (e.g. DUO codes) it makes more sense to have a dedicated namespace for those. So I think it makes sense to limit the scope to computed identifiers as written here.

Regardless, I agree with @larrybabb that the key message here is for TASC to create and maintain oversight, guidelines, and transparency on identifier prefixes within the GA4GH namespace.

@@ -0,0 +1,10 @@
| Type Prefix | Class Name | Definition | Example | Reference |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be computationally consumable IMO. I don't have a concern with a textual representation but the primary source should be elsewhere (this might be covered by Reece's comment above)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes machine readable-readable at the minimum, human-readable form may exist in sync with this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jb-adams see GA4GH YAML in VRS for an example of the computable resource we are seeking.

Copy link
Contributor

@andrewyatz andrewyatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments inline that I think need addressing

@larrybabb
Copy link

I think we may be considering embedding the version in the namespace as well? not sure yet, but this should be considered how versioning will be handled.

| SQ | Sequence | ... | `ga4gh:SQ:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html)|
| VA | Allele | ... | `ga4gh:VA.EgHPXXhULTwoP4-ACfs-YCXaeUQJBjH_` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) |
| VH | Haplotype | ... | `ga4gh:VH:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) |
| VAB | Abundance | ... | `ga4gh:VAB:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we changed Abundance to CopyNumber and I think VAB should be VCN (Alex W can verify)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larrybabb is correct. Though in fairness we haven't updated the documentation on VRS latest to reflect this. Made an issue to address this here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I wait until the VRS PR merges before I update it here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please wait for 1.2.1; I will put in PRs this week so that we may merge and release soon.

namespace/ga4gh-namespace-identifiers.md Outdated Show resolved Hide resolved
namespace/ga4gh-namespace-identifiers.md Outdated Show resolved Hide resolved

The computed identifier scheme proposed in the VR Specification computes identifiers from the data itself. Because identifers depend on the data, groups that independently generate the same variation will generate the same computed identifier for that entity, thereby obviating centralized identifier systems and enabling identifiers to be used in isolated settings such as clinical labs.

The computed identifier mechanism is broadly applicable and useful to the entire GA4GH ecosystem. Adopting a common identifier scheme will make interoperability of GA4GH entities more obvious to consumers, will enable the entire organization to share common entity definitions (such as sequence identifiers), and will enable all GA4GH products to share tooling that manipulate identified data. In short, it provides an important consistency within the GA4GH ecosystem.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realistically, the way GA4GH-namespaced identifiers are managed is already tightly coupled to the associated specification, so I can see the argument to allow non-computed identifiers in this namespace. However, in the case of entities that have registered identifiers (e.g. DUO codes) it makes more sense to have a dedicated namespace for those. So I think it makes sense to limit the scope to computed identifiers as written here.

Regardless, I agree with @larrybabb that the key message here is for TASC to create and maintain oversight, guidelines, and transparency on identifier prefixes within the GA4GH namespace.

namespace/ga4gh-namespace-identifiers.md Outdated Show resolved Hide resolved
To add an entry to the type prefix registry:
* Fork the TASC repository in your own user space
* In your user space, create a new branch from `ga4gh:master`
* In the new branch, add an additional row in the markdown table in `type-prefix-registry.md` with your proposed type prefix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also strongly agree.


## Proposal

The following algorithmic processes, described in depth in the VR [Computed Identifiers](https://vrs.ga4gh.org/en/1.0/impl-guide/computed_identifiers.html#computed-identifiers) proposal, are included in this proposal by reference:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following algorithmic processes, described in depth in the VR [Computed Identifiers](https://vrs.ga4gh.org/en/1.0/impl-guide/computed_identifiers.html#computed-identifiers) proposal, are included in this proposal by reference:
The following algorithmic processes, described in depth in VRS [Computed Identifiers](https://vrs.ga4gh.org/en/1.0/impl-guide/computed_identifiers.html#computed-identifiers), are included in this proposal by reference:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahwagner this has been updated, can you confirm and resolve?

namespace/ga4gh-namespace-identifiers.md Show resolved Hide resolved
namespace/ga4gh-namespace-identifiers.md Outdated Show resolved Hide resolved

The `digest` is computed as described above. The type_prefix is a short alphanumeric code that corresponds to the type of object being represented. If this propsal is accepted, this “type prefix map” would be administered by GA4GH. (Currently, this map is maintained in a YAML file within the vr-spec repository, but it would be relocated on approval of this proposal.)

We propose the following guidelines for type prefixes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We propose the following guidelines for type prefixes:
These are our recommendations for type prefixes:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahwagner this has been updated, can you confirm and resolve?

| SQ | Sequence | ... | `ga4gh:SQ:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html)|
| VA | Allele | ... | `ga4gh:VA.EgHPXXhULTwoP4-ACfs-YCXaeUQJBjH_` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) |
| VH | Haplotype | ... | `ga4gh:VH:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) |
| VAB | Abundance | ... | `ga4gh:VAB:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larrybabb is correct. Though in fairness we haven't updated the documentation on VRS latest to reflect this. Made an issue to address this here.

@jmarshall
Copy link
Member

jmarshall commented Nov 8, 2021

As I understand it, the primary purpose of this namespace identifier registry (in particular as it is written in this PR) is to provide a unified reference for spec authors listing the prefixes available and summarising what they mean. At least initially, IMHO machine processing would be a secondary concern; I would be interested to hear just what use cases people arguing for machine readability have in mind.

On the basis of being a listing primarily for human readers, IMHO the PR is a pretty good first cut and starting point. Rather than format the list as a table, I would suggest a document with a section for each prefix. For each item, this would start with boilerplate giving the prefix, class name(s), link to a reference spec(s); and then have space for (up to) several paragraphs of explanation of what that kind of identifier represents.

I have not yet reviewed the digest serialisation process, included from VRS by reference. On the face of it, I would be surprised if it were appropriate to limit all GA4GH namespace identifiers to this mechanism — but I am willing to be convinced.

@jb-adams jb-adams force-pushed the feature/issue-16-namespace-policy-setup branch from 83db059 to eb13691 Compare January 5, 2022 19:16
@jb-adams
Copy link
Member Author

jb-adams commented Jan 5, 2022

I think the open issues boil down to:

  1. Is the registry machine and/or human readable? What fields do we include for either format?
  2. Does the ga4gh namespace type prefix registry only pertain to computed identifiers? Or do we allow non-computed identifiers?

Regarding (1), I +1 @jmarshall 's idea for making each type prefix in the human-readable document as its own section, each with a consistent set of fields and space for longer description. I'm supportive of creating a machine-readable YAML or JSON that stays 1:1 with the human-readable documentation.

Regarding (2), my initial reaction is that we can allow non-computed identifiers within the registry, so long as we clearly indicate which prefixes use digest identifiers and which don't. But I think the VRS spec maintainers' opinions have more weight here, especially considering how they foresee the impact on downstream interoperability given either decision.

@rrfreimuth
Copy link
Collaborator

@jb-adams I think that sounds about right. I also +1 the proposal by @jmarshall .

I think my primary concern is that once defined, the prefixes are forever reserved across the GA4GH organization and must have a static definition. Therefore, anything that is adopted as a prefix must be considered permanent. In addition to providing a page to define these, we need a process to determine when a proposed prefix is mature enough to cast in stone.

@@ -0,0 +1,10 @@
| Type Prefix | Class Name | Definition | Example | Reference |
|-------------|------------|------------|---------|-----------|
| SQ | Sequence | ... | `ga4gh:SQ:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html)|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| SQ | Sequence | ... | `ga4gh:SQ:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html)|
| SQ | Sequence | ... | `ga4gh:SQ.…` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html)|

(…and similarly throughout the table)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred three periods to the ellipsis character, but this probably wouldn't cause any issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three periods would obviously be confusing next to a literal period character, but either way the ellipsis is just a placeholder until the examples (if they are retained) are filled in. But the comment is a reminder that the : should be a ..

@susanfairley
Copy link
Contributor

This was discussed on today's TASC call. Some points that were touched on include:

  1. Versioning
  2. Any other existing GA4GH standards where there might be the potential for namespace conflict?
  3. Should TASC issue guidance to development groups on claiming namespaces
  4. How this relates to namespaces in the wider community, beyond GA4GH

We will be aiming to take this work forward with TASC, the VRS leads and any other interested parties and are hoping to schedule a call to focus on this topic.

@ahwagner
Copy link
Member

This is great, thanks for advancing this work within TASC. Happy to bring forward associated conversations happening in VR. Here are slides from today's VR call with a new versioning proposal, for reference in those future conversations.

@mamanambiya mamanambiya deleted the branch ga4gh:master March 7, 2022 14:03
@mamanambiya mamanambiya closed this Mar 7, 2022
@mamanambiya
Copy link
Collaborator

mamanambiya commented Mar 7, 2022

PR closed automatically because of the deletion of the master branch while changing the default branch to main. We will coordinate with the submitter to reopen this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants